value estimator
- North America > United States > North Carolina (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Data Science (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
Generative Large-Scale Pre-trained Models for Automated Ad Bidding Optimization
Lei, Yu, Zhao, Jiayang, Zhao, Yilei, Zhang, Zhaoqi, Cai, Linyou, Xie, Qianlong, Wang, Xingxing
Modern auto-bidding systems are required to balance overall performance with diverse advertiser goals and real-world constraints, reflecting the dynamic and evolving needs of the industry. Recent advances in conditional generative models, such as transformers and diffusers, have enabled direct trajectory generation tailored to advertiser preferences, offering a promising alternative to traditional Markov Decision Process-based methods. However, these generative methods face significant challenges, such as the distribution shift between offline and online environments, limited exploration of the action space, and the necessity to meet constraints like marginal Cost-per-Mille (CPM) and Return on Investment (ROI). To tackle these challenges, we propose GRAD (Generative Reward-driven Ad-bidding with Mixture-of-Experts), a scalable foundation model for auto-bidding that combines an Action-Mixture-of-Experts module for diverse bidding action exploration with the Value Estimator of Causal Transformer for constraint-aware optimization. Extensive offline and online experiments demonstrate that GRAD significantly enhances platform revenue, highlighting its effectiveness in addressing the evolving and diverse requirements of modern advertisers. Furthermore, GRAD has been implemented in multiple marketing scenarios at Meituan, one of the world's largest online food delivery platforms, leading to a 2.18% increase in Gross Merchandise Value (GMV) and 10.68% increase in ROI.
- Asia > China > Beijing > Beijing (0.05)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- (2 more...)
- Research Report (0.64)
- Instructional Material (0.46)
- Marketing (0.94)
- Information Technology > Services (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.71)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > San Mateo County > Menlo Park (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- North America > United States > North Carolina (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Data Science (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
A Look at Value-Based Decision-Time vs. Background Planning Methods Across Different Settings
In model-based reinforcement learning (RL), an agent can leverage a learned model to improve its way of behaving in different ways. Two of the prevalent ways to do this are through decision-time and background planning methods. In this study, we are interested in understanding how the value-based versions of these two planning methods will compare against each other across different settings. Towards this goal, we first consider the simplest instantiations of value-based decision-time and background planning methods and provide theoretical results on which one will perform better in the regular RL and transfer learning settings. Then, we consider the modern instantiations of them and provide hypotheses on which one will perform better in the same settings. Finally, we perform illustrative experiments to validate these theoretical results and hypotheses. Overall, our findings suggest that even though value-based versions of the two planning methods perform on par in their simplest instantiations, the modern instantiations of value-based decision-time planning methods can perform on par or better than the modern instantiations of value-based background planning methods in both the regular RL and transfer learning settings.
- North America > Canada > Quebec > Montreal (0.28)
- North America > Canada > Alberta (0.04)
- North America > Barbados (0.04)
- Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)
Value Augmented Sampling for Language Model Alignment and Personalization
Han, Seungwook, Shenfeld, Idan, Srivastava, Akash, Kim, Yoon, Agrawal, Pulkit
Aligning Large Language Models (LLMs) to cater to different human preferences, learning new skills, and unlearning harmful behavior is an important problem. Search-based methods, such as Best-of-N or Monte-Carlo Tree Search, are performant, but impractical for LLM adaptation due to their high inference cost. On the other hand, using Reinforcement Learning (RL) for adaptation is computationally efficient, but performs worse due to the optimization challenges in co-training the value function and the policy. We present a new framework for reward optimization, Value Augmented Sampling (VAS), that can maximize different reward functions using data sampled from only the initial, frozen LLM. VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function, making the optimization stable, outperforming established baselines, such as PPO and DPO, on standard benchmarks, and achieving comparable results to Best-of-128 with lower inference cost. Unlike existing RL methods that require changing the weights of the LLM, VAS does not require access to the weights of the pre-trained LLM. Thus, it can even adapt LLMs (e.g., ChatGPT), which are available only as APIs. In addition, our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time, paving the road ahead for the future of aligned, personalized LLMs.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Switzerland (0.04)
- Asia > Singapore (0.04)
- Asia > Middle East > Jordan (0.04)
- Health & Medicine (0.67)
- Government (0.67)
- Automobiles & Trucks (0.67)
- Education (0.66)
How Can LLM Guide RL? A Value-Based Approach
Zhang, Shenao, Zheng, Sirui, Ke, Shuqi, Liu, Zhihan, Jin, Wanxin, Yuan, Jianbo, Yang, Yingxiang, Yang, Hongxia, Wang, Zhaoran
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT that simplifies the construction of the value function and employs subgoals to reduce the search complexity. Our experiments across three interactive environments ALFWorld, InterCode, and BlocksWorld demonstrate that our method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency. Our code is available at https://github.com/agentification/Language-Integrated-VI.
- Europe > Poland > Podlaskie Province > Bialystok (0.04)
- North America > United States > Arizona (0.04)
- Asia > China > Hong Kong (0.04)
Off-policy Evaluation in Doubly Inhomogeneous Environments
Bian, Zeyu, Shi, Chengchun, Qi, Zhengling, Wang, Lan
Reinforcement learning (RL, Sutton and Barto, 2018) aims to optimize an agent's long-term reward by learning an optimal policy that determines the best action to take under every circumstance. RL is closely related to the dynamic treatment regimens (DTR) or adaptive treatment strategies in statistical research for precision medicine (Murphy, 2003; Robins, 2004; Qian and Murphy, 2011; Kosorok and Moodie, 2015; Tsiatis et al., 2019; Qi et al., 2020; Zhou et al., 2022a), which seeks to obtain the optimal treatment policy in finite horizon settings with a few treatment stages that maximizes patients' expected outcome. Nevertheless, statistical methods for DTR mentioned above normally cannot handle large or infinite horizon settings. They require the number of trajectories to tend to infinity to achieve estimation consistency, unlike RL, which works even with finite number of trajectories under certain conditions. In addition to precision medicine, RL has been applied to various fields, such as games (Silver et al., 2016), ridesharing (Xu et al., 2018), mobile health (Liao et al., 2021) and robotics (Levine et al., 2020).
- North America > United States > New York (0.04)
- Asia > Middle East > Jordan (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (4 more...)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Health & Medicine > Therapeutic Area (0.68)
- Transportation > Ground > Road (0.34)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)